Biostat 212a Homework 6

Due Mar 22, 2025 @ 11:59PM

Author

Palash Raval and 406551574

Published

March 22, 2025

Load R libraries.

library(tidyverse)
library(tidymodels)
Warning: package 'broom' was built under R version 4.3.3
Warning: package 'dials' was built under R version 4.3.3
Warning: package 'modeldata' was built under R version 4.3.3
Warning: package 'recipes' was built under R version 4.3.3
Warning: package 'yardstick' was built under R version 4.3.3
library(readr)
library(tswge)
library(ggplot2)
library(yardstick)
library(ranger)
Warning: package 'ranger' was built under R version 4.3.3
acfdf <- function(vec) {
    vacf <- acf(vec, plot = F)
    with(vacf, data.frame(lag, acf))
}

ggacf <- function(vec) {
    ac <- acfdf(vec)
    ggplot(data = ac, aes(x = lag, y = acf)) + geom_hline(aes(yintercept = 0)) + 
        geom_segment(mapping = aes(xend = lag, yend = 0))
}

tplot <- function(vec) {
    df <- data.frame(X = vec, t = seq_along(vec))
    ggplot(data = df, aes(x = t, y = X)) + geom_line()
}

1 New York Stock Exchange (NYSE) data (1962-1986) (140 pts)

The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).

  • Log trading volume (\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.

  • Dow Jones return (\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.

  • Log volatility (\(z_t\)): This is based on the absolute values of daily price movements.

# Read in NYSE data from url

url = "https://raw.githubusercontent.com/ucla-biostat-212a/2025winter/master/slides/data/NYSE.csv"
NYSE <- read_csv(url)

NYSE
# A tibble: 6,051 × 6
   date       day_of_week DJ_return log_volume log_volatility train
   <date>     <chr>           <dbl>      <dbl>          <dbl> <lgl>
 1 1962-12-03 mon         -0.00446     0.0326           -13.1 TRUE 
 2 1962-12-04 tues         0.00781     0.346            -11.7 TRUE 
 3 1962-12-05 wed          0.00384     0.525            -11.7 TRUE 
 4 1962-12-06 thur        -0.00346     0.210            -11.6 TRUE 
 5 1962-12-07 fri          0.000568    0.0442           -11.7 TRUE 
 6 1962-12-10 mon         -0.0108      0.133            -10.9 TRUE 
 7 1962-12-11 tues         0.000124   -0.0115           -11.0 TRUE 
 8 1962-12-12 wed          0.00336     0.00161          -11.0 TRUE 
 9 1962-12-13 thur        -0.00330    -0.106            -11.0 TRUE 
10 1962-12-14 fri          0.00447    -0.138            -11.0 TRUE 
# ℹ 6,041 more rows

The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.

Code
ggacf(NYSE$log_volume) + ggthemes::theme_few()
Figure 1: The autocorrelation function for log volume. We see that nearby values are fairly strongly correlated, with correlations above 0.2 as far as 20 days apart.

Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\) Dow Jones return \(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\) Log volatility \(z_{t-\ell}\).

seq(1, 30) %>% 
  map(function(x) {cor(NYSE$log_volume , lag(NYSE$DJ_return, x), use = "pairwise.complete.obs")}) %>% 
  unlist() %>% 
  tibble(lag = 1:30, cor = .) %>% 
  ggplot(aes(x = lag, y = cor)) + 
  geom_hline(aes(yintercept = 0)) + 
  geom_segment(mapping = aes(xend = lag, yend = 0)) + 
  ggtitle("AutoCorrelation between `log volume` and lagged `DJ return`")

seq(1, 30) %>% 
  map(function(x) {cor(NYSE$log_volume , lag(NYSE$log_volatility, x), use = "pairwise.complete.obs")}) %>% 
  unlist() %>% 
  tibble(lag = 1:30, cor = .) %>% 
  ggplot(aes(x = lag, y = cor)) + 
  geom_hline(aes(yintercept = 0)) + 
  geom_segment(mapping = aes(xend = lag, yend = 0)) + 
  ggtitle("AutoCorrelation between `log volume` and lagged `log volatility`")

1.1 Project goal

Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.

The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.

In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.

Pay attention to the nuance of splitting time series data for cross validation. Study and use the time-series functionality in tidymodels. Make sure to use the same splits when tuning different machine learning algorithms.

Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.

1.2 Baseline method (20 pts)

We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.

L = 5

for(i in seq(1, L)) {
  NYSE = NYSE %>% 
    mutate(!!paste("DJ_return_lag", i, sep = "") := lag(NYSE$DJ_return, i),
           !!paste("log_volume_lag", i, sep = "") := lag(NYSE$log_volume, i),
           !!paste("log_volatility_lag", i, sep = "") := lag(NYSE$log_volatility, i))
}

NYSE = NYSE %>% na.omit()
NYSE_training = NYSE %>% 
  filter(train == 'TRUE') %>%
  select(-train) %>%
  drop_na()

dim(NYSE_training)
[1] 4276   20
NYSE_test = NYSE %>% 
  filter(train == 'FALSE') %>%
  select(-train) %>%
  drop_na()

dim(NYSE_test)
[1] 1770   20
r2_test_value = rsq_vec(NYSE_test$log_volume, NYSE_test$log_volume_lag1)

print(paste("Straw man test R2:", r2_test_value))
[1] "Straw man test R2: 0.348124998983674"

1.3 Autoregression (AR) forecaster (30 pts)

  • Let \[ y = \begin{pmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_T \end{pmatrix}, \quad M = \begin{pmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_{L} & \cdots & v_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{pmatrix}. \]

  • Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[ \hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L}, \] known as an order-\(L\) autoregression model or AR(\(L\)).

  • Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.

  • Hint: Workflow: Lasso is a good starting point.

# Read in NYSE data from url

url = "https://raw.githubusercontent.com/ucla-biostat-212a/2025winter/master/slides/data/NYSE.csv"
NYSE <- read_csv(url)

NYSE
# A tibble: 6,051 × 6
   date       day_of_week DJ_return log_volume log_volatility train
   <date>     <chr>           <dbl>      <dbl>          <dbl> <lgl>
 1 1962-12-03 mon         -0.00446     0.0326           -13.1 TRUE 
 2 1962-12-04 tues         0.00781     0.346            -11.7 TRUE 
 3 1962-12-05 wed          0.00384     0.525            -11.7 TRUE 
 4 1962-12-06 thur        -0.00346     0.210            -11.6 TRUE 
 5 1962-12-07 fri          0.000568    0.0442           -11.7 TRUE 
 6 1962-12-10 mon         -0.0108      0.133            -10.9 TRUE 
 7 1962-12-11 tues         0.000124   -0.0115           -11.0 TRUE 
 8 1962-12-12 wed          0.00336     0.00161          -11.0 TRUE 
 9 1962-12-13 thur        -0.00330    -0.106            -11.0 TRUE 
10 1962-12-14 fri          0.00447    -0.138            -11.0 TRUE 
# ℹ 6,041 more rows
NYSE_training = NYSE %>% 
  filter(train == 'TRUE') %>%
  select(-train) %>%
  drop_na()

dim(NYSE_training)
[1] 4281    5
NYSE_test = NYSE %>% 
  filter(train == 'FALSE') %>%
  select(-train) %>%
  drop_na()

dim(NYSE_test)
[1] 1770    5
NYSE %>%
  ggplot(aes(x = date, y = log_volume)) + 
  geom_line() + 
  geom_smooth(method = "lm")

wrong_split <- initial_split(NYSE_training)

bind_rows(
  training(wrong_split) %>% mutate(type = "train"),
  testing(wrong_split) %>% mutate(type = "test")
) %>% 
  ggplot(aes(x = date, y = log_volume, color = type, group = NA)) + 
  geom_line()

correct_split <- initial_time_split(NYSE_training %>% arrange(date))

bind_rows(
  training(correct_split) %>% mutate(type = "train"),
  testing(correct_split) %>% mutate(type = "test")
) %>% 
  ggplot(aes(x = date, y = log_volume, color = type, group = NA)) + 
  geom_line()

rolling_origin(NYSE_training %>% arrange(date), initial = 30, 
               assess = 5, cumulative = F) %>%
  mutate(train_data = map(splits, analysis),
         test_data = map(splits, assessment)) %>% 
  select(-splits) %>% 
  pivot_longer(-id) %>% 
  filter(id %in% c("Slice0001", "Slice0100", "Slice1000")) %>% 
  unnest(value) %>% 
  ggplot(aes(x = date, y = log_volume, color = name, group = NA)) + 
  geom_point() + 
  geom_line() +
  facet_wrap(~id, scales = "free")

sliding_period(NYSE_training %>% arrange(date), 
               date, period = "month", lookback = Inf, assess_stop = 1) %>% 
  mutate(train_data = map(splits, analysis),
         test_data = map(splits, assessment)) %>% 
  select(-splits) %>% 
  pivot_longer(-id) %>% 
  filter(id %in% c("Slice001", "Slice050", "Slice100")) %>% 
  unnest(value) %>% 
  ggplot(aes(x = date, y = log_volume, color = name, group = NA)) + 
  geom_point() +
  geom_line() + 
  facet_wrap(~id, scales = "free")

elastic_net_recipe = recipe(data = NYSE_training, log_volume ~ .) %>%
  step_rm(day_of_week) %>%
  step_lag(DJ_return, log_volatility, lag = 1:5) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric_predictors(), -all_outcomes()) %>%
  step_naomit(all_predictors()) %>%
  prep(data = NYSE_training)
elastic_net_model = linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")
elastic_net_workflow = workflow() %>%
  add_model(elastic_net_model) %>%
  add_recipe(elastic_net_recipe %>% step_rm(date) %>% step_indicate_na())
folds = NYSE_training %>% 
  arrange(date) %>%
  rolling_origin(initial = 30, assess = 5)
month_folds = NYSE_training %>%
  sliding_period(date, "month", lookback = Inf, skip = 4)
elastic_net_grid = grid_regular(penalty(range = c(-7, -3), 
                                        log10_trans()), 
                                mixture(), levels = 5)

elastic_net_grid
# A tibble: 25 × 2
     penalty mixture
       <dbl>   <dbl>
 1 0.0000001    0   
 2 0.000001     0   
 3 0.00001      0   
 4 0.0001       0   
 5 0.001        0   
 6 0.0000001    0.25
 7 0.000001     0.25
 8 0.00001      0.25
 9 0.0001       0.25
10 0.001        0.25
# ℹ 15 more rows
elastic_net_fit = tune_grid(elastic_net_workflow, resamples = month_folds, 
                            grid = elastic_net_grid)
elastic_net_fit %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  ggplot(mapping = aes(x = penalty, y = mean, color = factor(mixture))) + 
  geom_point() +
  labs(x = "Penalty", y = "CV RSQ")

best_elastic_net = elastic_net_fit %>% select_best(metric = "rmse")

best_elastic_net
# A tibble: 1 × 3
    penalty mixture .config              
      <dbl>   <dbl> <chr>                
1 0.0000001       1 Preprocessor1_Model21
# CV R2

elastic_net_fit %>% 
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  filter(penalty == best_elastic_net$penalty & 
           mixture == best_elastic_net$mixture)
# A tibble: 1 × 8
    penalty mixture .metric .estimator  mean     n std_err .config              
      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1 0.0000001       1 rsq     standard   0.260   200  0.0155 Preprocessor1_Model21
elastic_net_final_workflow = elastic_net_workflow %>%
  finalize_workflow(best_elastic_net)
final_elastic_model = elastic_net_final_workflow %>%
  fit(NYSE_training)

test_predictions = final_elastic_model %>% predict(new_data = NYSE_test) %>%
  bind_cols(NYSE_test)

evaluation_metrics = test_predictions %>%
  metrics(truth = log_volume, estimate = .pred)

evaluation_metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.213
2 rsq     standard       0.215
3 mae     standard       0.161

1.4 Random forest forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the random forest. Tune the random forest and evaluate the test performance.

  • Hint: Workflow: Random Forest for Prediction is a good starting point.

random_forest_recipe = recipe(data = NYSE_training, 
                         log_volume ~ .) %>%
  step_rm(day_of_week) %>%
  step_lag(DJ_return, log_volatility, lag = 1:5) %>%
  step_naomit(all_predictors()) %>%
  step_zv(all_numeric_predictors()) %>%
  prep(data = NYSE_training)
random_forest_model = rand_forest(mode = "regression",
                                  mtry = tune(), 
                                  trees = tune()) %>%
  set_engine("ranger")
random_forest_workflow = workflow() %>%
  add_model(random_forest_model) %>%
  add_recipe(random_forest_recipe %>% step_rm(date) %>% step_indicate_na())
random_forest_grid = grid_regular(trees(range = c(75L, 250L)), 
                                  mtry(range = c(1L, 5L)),
                                  levels = c(3,5))
random_forest_fit = tune_grid(random_forest_workflow, resamples = month_folds, 
                            grid = random_forest_grid)
random_forest_fit %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  ggplot(mapping = aes(x = trees, y = mean, color = factor(mtry))) + 
  geom_point() +
  labs(x = "Number of Trees", y = "RSQ")

best_random_forest = random_forest_fit %>% select_best(metric = "rsq")

best_random_forest
# A tibble: 1 × 3
   mtry trees .config              
  <int> <int> <chr>                
1     4   250 Preprocessor1_Model12
# CV R2

random_forest_fit %>% 
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  filter(trees == best_random_forest$trees & 
           mtry == best_random_forest$mtry)
# A tibble: 1 × 8
   mtry trees .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     4   250 rsq     standard   0.200   200  0.0126 Preprocessor1_Model12
random_forest_final_workflow = random_forest_workflow %>%
  finalize_workflow(best_random_forest)
final_random_forest_model = random_forest_final_workflow %>%
  fit(NYSE_training)

test_predictions = final_random_forest_model %>% 
  predict(new_data = NYSE_test) %>%
  bind_cols(NYSE_test)

evaluation_metrics = test_predictions %>%
  metrics(truth = log_volume, estimate = .pred)

evaluation_metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.218
2 rsq     standard       0.189
3 mae     standard       0.165

1.5 Boosting forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the boosting. Tune the boosting algorithm and evaluate the test performance.

  • Hint: Workflow: Boosting tree for Prediction is a good starting point.

boosting_recipe = recipe(data = NYSE_training, 
                         log_volume ~ .) %>%
  step_rm(day_of_week) %>%
  step_lag(DJ_return, log_volatility, lag = 1:5) %>%
  step_naomit(all_predictors()) %>%
  step_zv(all_numeric_predictors()) %>%
  prep(data = NYSE_training)
boosting_model = boost_tree(mode = "regression",
                            trees = 1000,
                            tree_depth = tune(),
                            learn_rate = tune()) %>%
  set_engine("xgboost")
boosting_workflow = workflow() %>%
  add_model(boosting_model) %>%
  add_recipe(boosting_recipe %>% step_rm(date) %>% step_indicate_na())
boosting_grid = grid_regular(tree_depth(range = c(1L, 5L)), 
                                  learn_rate(range = c(-3, -0.1),
                                             trans = log10_trans()),
                                  levels = c(3,5))
boosting_fit = tune_grid(boosting_workflow, resamples = month_folds, 
                            grid = boosting_grid)
Warning: package 'xgboost' was built under R version 4.3.3
boosting_fit %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) + 
  geom_point() +
  labs(x = "Learn Rate", y = "RSQ")

best_boosting = boosting_fit %>% select_best(metric = "rsq")

best_boosting
# A tibble: 1 × 3
  tree_depth learn_rate .config              
       <int>      <dbl> <chr>                
1          1    0.00531 Preprocessor1_Model04
# CV R2

boosting_fit %>% 
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  filter(tree_depth == best_boosting$tree_depth & 
           learn_rate == best_boosting$learn_rate)
# A tibble: 1 × 8
  tree_depth learn_rate .metric .estimator  mean     n std_err .config          
       <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
1          1    0.00531 rsq     standard   0.200   200  0.0128 Preprocessor1_Mo…
boosting_final_workflow = boosting_workflow %>%
  finalize_workflow(best_boosting)
final_boosting_model = boosting_final_workflow %>%
  fit(NYSE_training)

test_predictions = final_boosting_model %>% 
  predict(new_data = NYSE_test) %>%
  bind_cols(NYSE_test)

evaluation_metrics = test_predictions %>%
  metrics(truth = log_volume, estimate = .pred)

evaluation_metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.222
2 rsq     standard       0.174
3 mae     standard       0.167

1.6 Summary (30pts)

Your score for this question is largely determined by your final test performance.

Summarize the performance of different machine learning forecasters in the following format.

Method CV \(R^2\) Test \(R^2\)
Baseline NA 0.35
AR(5) 0.26 0.22
Random Forest 0.20 0.19
Boosting 0.20 0.17

The baseline had a Test R2 value. The AR(5) method gave a CV R2 of 0.26 with a test R2 of 0.22. The Random Forest Method had a CV R2 of 0.20 and a test R2 of 0.19. The Boosting Method had a CV R2 of 0.20 and a Test R2 of 0.17.

From this, it appears that the Baseline method seemed to provide the best results compared to the other three methods. However, I would say none of these methods do a satisfactory job at predicting the “log_volume” and I would not rely on any of these methods for dependable predicitions. Boosting seemed to do the worst job at predicting “log_volume” compared to the other methods, while the CV R2 seemed to be close to the CV R2 values for AR(5) and Random Forest.

2 ISL Exercise 12.6.13 (90 pts)

2.1 12.6.13 (b) (30 pts)

data = read_csv("../../slides/data/Ch12Ex13.csv", 
                col_names = paste("ID", 1:40, sep = ""))

head(data)
# A tibble: 6 × 40
      ID1    ID2    ID3    ID4    ID5    ID6     ID7     ID8     ID9   ID10
    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 -0.962   0.442 -0.975  1.42   0.819  0.316 -0.0250 -0.0640  0.0315 -0.350
2 -0.293  -1.14   0.196 -1.28  -0.251  2.51  -0.922   0.0595 -1.41   -0.657
3  0.259  -0.973  0.588 -0.800 -1.82  -2.06  -0.0648  1.59   -0.173  -0.121
4 -1.15   -2.21  -0.862  0.631  0.952 -1.17  -0.392   1.06   -0.350  -1.49 
5  0.196   0.593  0.283  0.247  1.98  -0.871 -0.990  -1.03   -1.11   -0.385
6  0.0301 -0.691 -0.403 -0.730 -0.364  1.13  -1.40   -0.806  -1.24    0.578
# ℹ 30 more variables: ID11 <dbl>, ID12 <dbl>, ID13 <dbl>, ID14 <dbl>,
#   ID15 <dbl>, ID16 <dbl>, ID17 <dbl>, ID18 <dbl>, ID19 <dbl>, ID20 <dbl>,
#   ID21 <dbl>, ID22 <dbl>, ID23 <dbl>, ID24 <dbl>, ID25 <dbl>, ID26 <dbl>,
#   ID27 <dbl>, ID28 <dbl>, ID29 <dbl>, ID30 <dbl>, ID31 <dbl>, ID32 <dbl>,
#   ID33 <dbl>, ID34 <dbl>, ID35 <dbl>, ID36 <dbl>, ID37 <dbl>, ID38 <dbl>,
#   ID39 <dbl>, ID40 <dbl>
linkage1 = hclust(as.dist(1 - cor(data)), method = "complete")

plot(linkage1, main = "Cluster Dendrogram with Complete Linkage")

linkage2 = hclust(as.dist(1 - cor(data)), method = "single")

plot(linkage2, main = "Cluster Dendrogram with Single Linkage")

linkage3 = hclust(as.dist(1 - cor(data)), method = "average")

plot(linkage3, main = "Cluster Dendrogram with Average Linkage")

Yes, the genes separate the samples into 2 groups for most of the linkages. The only dendogram that seems to have 3 groups is with a linkage of “average”. The linkages of “single” and “complete” both appear to separate into 2 main groups. The “complete” linkage seems to separate the best according to the dendrogram plots.

2.2 PCA and UMAP (30 pts)

PCA:

pca_values = prcomp(data, scale = TRUE)
pca_recipe = recipe(~., data = data) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

pca_prep = prep(pca_recipe)
juice(pca_prep) %>%
  ggplot(aes(PC1, PC2)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(color = NULL)

library(embed)
Warning: package 'embed' was built under R version 4.3.3
umap_recipe = recipe(~., data = data) %>%
  step_normalize(all_predictors()) %>%
  step_umap(all_predictors())

umap_prep = prep(umap_recipe)
juice(umap_prep) %>%
  ggplot(aes(UMAP1, UMAP2)) +
  geom_point(aes(), alpha = 1, size = 4) + 
  labs(color = NULL)

For both UMAP and PCA, there appears to be 2 separate clusters which I presume to indicate the “healthy” and “diseased” groups. One group also seems to have more density than the other.

2.3 12.6.13 (c) (30 pts)

label = c(rep("healthy", 20), rep("diseased", 20))

p_values = apply(data, 1, function(x) {
  t.test(x[label == "healthy"], x[label == "diseased"])$p.value
})
adjusted_p_values = p.adjust(p_values, method = "BH") 

significant_p_values = which(adjusted_p_values < 0.05)

significant_p_values
  [1]  11  12  13  14  15  16  17  18  19  20 135 156 172 448 501 502 503 504
 [19] 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522
 [37] 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
 [55] 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558
 [73] 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
 [91] 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594
[109] 595 596 597 598 599 600 615 853 967

One method to determine which genes differ the most across the two groups is using Multiple Testing, as done above. Since we know that the first 20 columns are in the “healthy” group and the other 20 are in the “diseased” group, we can use t-test to get a p-value for each of the genes(rows). To correct for the false discovery rate, the Benjamini-Hochberg method is used to adjust the p-values. Then, the index of all the rows that have a p-value less than the 0.05 alpha value(significant) are found, which are the genes that differ the most across both groups.